HTML Document Analysis for Information Extraction
نویسنده
چکیده
The today’s World Wide Web contains a vast amount of information stored in HTML documents. However, the HTML language primarily describes the look of the documents and it doesn’t contain facilities for the description of contained data structure. In this paper we propose a model of a Web site that describes logical structure of contained data. Furthermore, we propose methods for creating such a model by analyzing the look and the structure of HTML documents.
منابع مشابه
Visual HTML Document Modeling for Information Extraction
Current methods of information extraction from HTML documents are mostly based on the discovery of some patterns in the HTML code that are expected to identify a particular information in the document. However, this approach has several common problems that are caused mainly by the great variability of HTML and related technologies and the fact that the HTML constructions have no direct binding...
متن کاملInformation Extraction from HTML Documents Based on Logical Document Structure
The World Wide Web presents the largest Internet source of information from a broad range of areas. The web documents are mostly written in the Hypertext Markup Language (HTML) that doesn’t contain any means for semantic description of the content and thus the contained information cannot be processed directly. Current approaches for the information extraction from HTML are mostly based on wrap...
متن کاملExtracting structure from HTML documents for language visualization and analysis
Document analysis is shifting from document image analysis to the analysis of electronic documents, especially those available on the Web in HTML and PDF formats. We are analyzing a 250M word collection of HTML formatted papers from the American Society for Microbiology with the ultimate goal of doing query answering and information extraction. Each document is converted to a sequence of token-...
متن کاملCategorisation of web documents using extraction ontologies
Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, our document recognition system extracts expected ontological vocabulary (keywords and keyword phrases) and expected ontological instance d...
متن کاملA DOM-based Anchor-Hop-T Method for Web Application Information Extraction
In order to implement the information fusion of electronic products, the widely adopted approach is to extract information from HTML structure of business Website with deeply data processing. However, modeling Web application is hard to be solved that the data in HTML is semi-formal which displayed as DOM (Document Object Model) tree when using XML schema to data analysis. How to understand and...
متن کامل